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Abstract 

In a novel approach to the multiple testing problem, Efron (2004; 2007) formulated estima- 
tors of the distribution of test statistics or nominal p- values under a null distribution suitable 
for modeling the data of thousands of unaffected genes, non-associated single-nucleotide poly- 
morphisms, or other biological features. Estimators of the null distribution can improve not 
only the empirical Baj'es procedure for which it was originally intended, but also many other 
multiple comparison procedures. Such estimators serve as the groundwork for the proposed 
multiple comparison procedure based on a recent frequentist method of minimizing posterior 
expected loss, exemplified with a non-additive loss function designed for genomic screening 
rather than for validation. 

The merit of estimating the null distribution is examined from the vantage point of condi- 
tional inference in the remainder of the paper. In a simulation study of genome-scale multiple 
testing, conditioning the observed confidence level on the estimated null distribution as an ap- 
proximate ancillary statistic markedly improved conditional inference. To enable researchers to 
determine whether to rely on a particular estimated null distribution for inference or decision 
making, an information-theoretic score is provided that quantifies the benefit of conditioning. 
As the sum of the degree of ancillarity and the degree of inferential relevance, the score reflects 
the balance conditioning would strike between the two conflicting terms. 

Applications to gene expression microarray data illustrate the methods introduced. 

Keywords: ancillarity; attained confidence level; composite hypothesis testing; conditional infer- 
ence; empirical null distribution; GWA; multiple comparison procedures; observed confidence level 
simultaneous inference; simultaneous significance testing; SNP 
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1 Introduction 



1,1 Multiple comparison procedures 

1.1.1 Aims of multiple-comparison adjustments 

Controversy surrounding whether and how to adjust analysis results for multiple comparisons can 
be partly resolved by recognizing that a procedure that works well for one purpose is often poorly 
suited for another since different types of procedures solve distinct statistical problems. Methods of 
adjustment have been developed to attain three goals, the first two of which optimize some measure 
of sample space performance: 

1. Adjustment for selection. The most common concern leading to multiple-comparison adjust- 
ments stems from the observation that results can achieve nominal statistical significance 
because they were selected to do so rather than because of a reproducible effect. Adjustments 
of this type are usually based on control of a Type I error rate such as a family- wise error rate 



or a false discovery rate as defined by Benjamini and Hochberg (19951. Dudoit et al. (20031 



reviewed several options in the context of gene expression microarray data. 



2. Minimization of a risk function. Stein ( 1956 I proved that the maximum likelihood estimator 



(MLE) is inadmissible for estimation of a multivariate normal mean under squared error loss, 
even in the absence of correlation. Efron and Morris ( 1973 1 extended the result by establishing 



that the MLE is dominated by a wide class of estimators derived via an empirical Bayes 



approach in which the mean is random. More recently, Ghosh (20061 adjusted p-values for 



multiple comparisons by minimizing their risk as estimators of a posterior probability. In the 
presence of genome-scale numbers of comparisons, adjustments based on hierarchical models 
are often much less extreme than those needed to adjust for selection. For two examples 



from microarray data analysis, Efron (20081 found that posterior intervals based on a local 



false discovery rate (LFDR) estimate tend to be substantially narrower than those needed to 
control the false coverage rate introduced by Benjamini et al. (2005 I to account for selection. 



and an LFDR-based posterior mean has insufficient shrinkage toward the null to adequately 



correct selection bias (Bickel 2008a I. 



3. Estimation of null or alternative distributions. Measurements over thousands of biological fea- 
tures available from studies of genome-scale expression and genome-wide association studies 
have recently enabled estimation of distributions of p-values. Early empirical Bayes methods 
of estimating the LFDR associated with each null hypothesis employ estimates of the distri- 



bution of test statistics or p-values under the alternative hypothesis (e.g., Efron et al. 2001 1. 



Efron (2004 2007a I went further, demonstrating the value of also estimating the distribution 



of p- values under the null hypothesis provided a sufficiently large number of hypotheses under 
simultaneous consideration. 

While all three aims are relevant to Neyman-Pearson testing, they differ as much in their relevance 
to Fisherian significance testing as in the procedures they motivate. Mayo and Cox (20061 pointed 



out that Type I error rate control is appropriate for making series of decisions but not for inductive 
reasoning, where the inferential evaluation of evidence is of concern apart from loss functions that 
depend on how that evidence will be used, which, as Fisher (1973 pp. 95-96, 103-106) stressed. 



might not even be known at the time of data analysis. Likewise, Hill (19901 and Gleser (19901 found 



optimization over the sample space helpful for making series of decisions rather than for drawing 
scientific inferences from a particular observed sample. Cox ( 1958 2006 1 noted that selection of 



a function to optimize is inherently subjective to the extent that different decision makers have 
different interests. Further, sample space optimality is often achieved at the expense of induction 
about the parameter given the data at hand; for example, optimal confidence intervals result from 
systematically stretching them in samples of low variance and reducing them in samples of high 



variance relative to their conditional counterparts (Cox 1958 Barnard 1976 Eraser and Reid 
Eraser! [20Q"4a|b| |. 



1990 
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The suitability of the methods of both of the first two goals for decision rules as opposed to 
inductive reasoning is consistent with the observation that control of Type I error rates may be 
formulated as a minimax problem (e.g., Lehmann|1950 Wald|196T §1.5), indicating that the second 
of the above aims generalizes the first. Although corrections in order to account for selection are 



often applied when it is believed that only a small fraction of null hypotheses are false (Cox 20061, 



the methods of controlling a Type I error rate used to make such corrections are framed in terms 
of rejection decisions and thus may depend on the number of tests conducted, which would not be 
the case were the degree of correction a function only of prior beliefs. By contrast with the first two 
aims, the third aim, improved specification of the alternative or null distribution of test statistics, is 
clearly as important in significance testing as in fixed-level Neyman-Pearson testing. In short, while 
the first two motivations for multiple comparison procedures address decision-theoretic problems, 
only the third pertains to significance testing in the sense of impartially weighing evidence without 
regard to possible consequences of actions that might be taken as a result of the findings. 



1.1.2 Estimating the null distribution 

Because of its novelty and its potential importance for many frequentist procedures of multiple 



comparisons, the effect of relying on the following method due to Efron (2004 2007a 2007b I of 



estimating the null distribution will be examined herein. The method rests on the assumption that 
about 90% or more of a large number of p-values correspond to unaffected features and thus have 
a common distribution called the true null distribution. If that distribution is uniform, then the 
assumed null distribution of test statistics with respect to which the p-values were computed is 
correct. 

In order to model the null distribution as a member of the normal family, the p-values are 
transformed by <I>~^ : [0, 1] ^ M^, the standard normal quantile function. The parameters of that 
distribution are estimated either by fitting a curve to the central region of a histogram of the 
transformed p-values (Efron [2004 1 or, as used below, by applying a maximum likelihood procedure 
to a truncated normal distribution (Efron 2007b I. The main justification for both algorithms is 



that since nearly all p-values are modeled as variates from the true null distribution and since 
the remaining p-values are considered drawn from a distribution with wider tails, the less extreme 
p-values better resemble the true null distribution than do those that are more extreme. Since the 
theoretical null distribution is standard normal in the transformed domain, deviations from the 
standard normal distribution reflect departures in the less extreme p-values from uniformity in the 
original domain. 

For use in multiple testing, all of the transformed p-values of the data set are treated as test 
statistics for the derivation of new p-values with respect to the null distribution estimated as de- 
scribed above instead of the assumed null distribution. Such adjusted p-values would be suitable for 
inductive inference or for decision-theoretic analyses such as those controlling error rates, provided 
that the true null distribution tends to be closer to the estimated null distribution than it is to the 
assumed null distribution. 



1.2 Overview 

The next section presents a confidence-based distribution of a vector parameter in order to unify the 
present study of null distribution estimation within a single framework. The general framework is 
then applied to the problem of estimating the null distribution in Section [3T| Section [3^ introduces 
a multiple comparisons procedure for coherent decisions made possible by the confidence-based 
posterior without recourse to Bayesian or empirical Bayesian models. 

Adjusting p-values by the estimated null distribution is interpreted as inference conditional 
on that estimate in Section |4] The simulation study of Section |4.1| demonstrates that estimation 
of the null distribution can substantially improve conditional inference even when the assumed 



null distribution is correct marginal over a precision statistic. Section 4.2 provides a method for 
determining whether the estimated null distribution is sufficiently ancillary and relevant for effective 
conditional inference or decision making on the basis of a given data set. 
Section [5] concludes with a discussion of the new findings and methods. 
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2 Statistical framework 



2,1 Confidence levels as posterior probabilities 



The observed data vector x g is modeled as a realization of a random quantity X of distribution 
P^, a probability distribution on the measurable space (fi, E) that is specified by the full parameter 



Let 9 = 9 denote a parameter of interest in Q and 7 = 7 (C) ^ nuisance parameter. 



Definition 1. In addition to the above family of probability measures {Pj : ^ S consider 
a family of probability measures {P^ : x G H.} , each on the space {Q,A) , and a set Tl{S) — 

|0p,s(p) : P G [0, 1] , s e 5| of region estimators corresponding to a set S of shape functions, where 

0p,s(p) ■ ^ ^ A for all p e [0, 1] and s e S*. If, for every Q' G A, x G il, and ^ €: S, there exist a 
coverage rate p and shape s (p) such that 



P- (e') = p=Pc (0(Oe0p,.(p) W 



(1) 



and Qp.s(p) (x) = (0') , then the probability P^ (0') is the confidence level of the hypothesis 
(^) e 0' according to P^, the confidence measure over corresponding to TZ (S) . 

Remark 2. Unless the cr-field A is Borel, the confidence level of the hypothesis of interest will not 



necessarily be defined; cf. McCullagh (20041. 



Building on work of Efron and Tibshirani (19981 and others, Polansky (20071 used the equiva- 
lent of P^ to concisely communicate a distribution of "observed confidence" or "attained confidence" 
levels for each hypothesis that 9 lies in some region O'. The decision-theoretic "certainty" interpre- 



tation of P^ as a non-Bayesian posterior ( Bickel| [2009 1 serves the same purpose but also ensures 
the coherence of actions that minimize expected posterior loss. Robinson (19791 also considered 
interpreting the ratio p/ (1 — p) from equation ([ij as odds for betting on the hypothesis that 9 G 0'. 
The posterior distribution need not conform to the Bayes update rule (Bickel 20091 since deci- 



sions that minimize posterior expected loss, or, equivalently, maximize expected utility, are coherent 
as long as the posterior distribution is some finitely additive probability distribution over parameter 
space (see, e.g., Anscombe and Aumann 19631. It follows that an intelligent agent that acts as if 



p/ (1 — p) are fair betting odds for the hypothesis that 9 lies in a level-p confidence region estimated 
by some region estimator of exact coverage rate p is coherent if and only if its actions minimize 
expected loss with the expectation value over a confidence measure as the probability distribution 
defining the expectation value (cf. [Bickel 2009| . Minimizing expected loss over the parameter 



space, whether based on a confidence posterior or on a Bayesian posterior, differs fundamentally 
from the decision-theoretic approach of Section |1.1| in that the former is optimal given the single 
sample actually observed whereas the latter is optimal over repeated sampling. Section |3.2| illus- 
trates the minimization of confidence-measure expected loss with an application to screening on the 
basis of genomics data. 



2,2 Confidence levels versus p- values 

Whether confidence levels agree with p-values depends on the parameter of interest and on the 
chosen hypotheses. If is a scalar and the null hypothesis is 9 = 9' , the p-values associated with 
the alternative hypotheses 9 > 9' and 9 < 6' are P^ ((—00,6'')) and P^ {{9', 00)) , respectively; cf. 



Schweder and Hjort (20021 



On the other hand, a p-value associated with a two-sided alternative is not typically equal to the 
confidence level P^ {{0'}) ■ Polansky (2007 pp. 126-128, 216) discusses the tendency of the attained 
confidence level of a point or simple hypotheses such as 6 — 6' to vanish in a continuous parameter 
space. That only a finite number of points in hypothesis space have nonzero confidence is required 
of any evidence scale that is fractional in the sense that the total strength of evidence over is 
finite. (Fractional scales enable statements of the form, "the negative, null, and positive hypotheses 
are 80%, 15%, and 5% supported by the data, respectively.") While the usual two-sided p-value 
vanishes only for sufficiently large samples, the confidence level P^ {{(^'}) typically is 0% even for 
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the smallest samples and thus does not lead to the appearance of a paradox of "over-powered" 
studies. As a remedy, Hodges and Lehmann (1954} proposed testing an interval hypothesis 9 £ Q' 
defined in terms of scientific significance; in this situation, as with composite hypothesis-testing 
in general, (O') converges in probability to le' {0) even though the two-sided p-value does not 
(Bickel 2009| . (Testing a simple null hypothesis against a composite alternative hypothesis yields a 
similar discrepancy between a two-sided p- value and methods that respect the likelihood principle 
( Levine[ pOG) |Bickel[ [20Q8b| .) 

There are nonetheless situations that, when using p-values for statistical significance, necessitate 
testing a hypothesis known to be false for all practical purposes. Cox (19771 called a null hypothesis 
6 — 9' dividing if it is not considered because it could possibly be approximately true but rather 
because it lies on the boundary between 9 < 9' and 9 > 9', the two hypotheses of genuine interest. 
For example, a test of equality of means and its associated two-sided p-value often serve the purpose 
of determining whether there are enough data to determine the direction of the difference when 
it is known that there is some appreciable difference ( Cox[ 19771. That goal can be more directly 
attained by comparing the confidence levels {{—oo,9')) and {{9',oo)) . While reporting the 
ratio or maximum of P^ {{—oo, 9')) and P^ {{9' , oo)) would summarize the confidence level of each 
of two regions in a single number, such a number may be more susceptible to misinterpretation 
than a report of the pair of confidence levels. 



2.3 Simultaneous inference 

In the typical genome-scale problem, there are d scalar parameters 9i, 92, 9^ and d corresponding 
observables Xi, X2, Xd, such that d > 1000 and 9i = 9i (^) is a subparameter of the distribution 
of Xi, the random quantity of which the observation Xi g fii is a realized vector. The ith of the d 
hypotheses to be simultaneously tested is 9i S Q'^ for some Q'^ in Qi, a subset of M^. Representing 
numeric tuples under the angular bracket convention to distinguish the open interval {x,y) from 
the ordered pair {x,y) , 9 = 9 {^) — (6*1, ^2, ■ • • , 9d) is the d-dimensional subparameter of interest 
and the joint hypothesis is 9 (^) € 6', where 6' = O'^ x 63 x • • • x Q'^. 

For any S G {1,2, ...,d — 1} , inference may focus on 5 of the scalar parameters as opposed to the 
entire vector 9. For example, separate consideration of the confidence levels of hypotheses such as 
9i e Q'l or of {9i, 92) G Q'l X Q2 can be informative, especially if d is high. Each of the components 
of the focus index l = {i (1) ,i {2) , . . . ,i (S)) is in {l,...,d} and is unequal to each of its other 
components. The proper subset Q[ — B^^-^^ x 6^(2) x • • ■ x °f ®t ~ ^ ®«(2) x • • ■ x ©^(5) 
is defined in order to weigh the evidence for the hypothesis that 9^, = (^i(i), ^^i(2)) ■ • • , ^ 
Setting = e'l X e'2 X • • • X 6;, such that 6^- = 6^ for all j {i (1) ,i (2) , . . . ,i {S)} , define the 

marginal distribution such that Pj^ (®') ^Qual to the confidence level P^ (0') . Thus, Pj^ is 
a probability measure marginal over all 9j with j ^ {i (1) , i (2) , . . . , i (6)} . 

The following lemma streamlines inference focused on whether 6',, G Q[, or, equivalently, 9 (^) g 
Q[ , by establishing sufficient conditions for the confidence level marginal over some of the d compo- 
nents of 9 to be equal to the parameter coverage probability marginal over the data corresponding 
to those components. 

Lemma 3. Considering a focus index l and X^ = (^Xii^i-^, Xn2), ■ ■ ■ , Xi(5)) , l^t ®p s(p) ■ ^ ~^ 
Al be the corresponding level-p set estimator of some shape parameter s (p) defined such that for 
every a: G il, 6^ ^^^^ {x) is the canonical projection of Qp,s{p) (x) from A to Ai, the a-field of 

the marginal distribution P'f . If there is a map ^(^^ : Q.^ ~^ A^ such that Qp ^(^p^ ("^') '^'^'^ 
Qp gj-p-j (X) are identically distributed, then Pj^ is the confidence measure over corresponding to 

Proof. By the general definition of confidence level ([T|| , 

(e:) = p- (61) - Pf (0 e e,,,(,) {X)) , 
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where the coverage rate p and shape parameter s (p) are constrained such that ©p,s(p) {^) = ©I for 
the observed value x of random element X. Hence, using to denote the event that 9j G Q'j , 

p^{e[)=p^[e,e%^^^^^{x),A,) (2) 

with the coverage rate p and shape parameter s (p) restricted such that ^^.^^ (x) — Q[. Considering 
j ^ {i (1) ,i (2) , . . . ,i (6)} , the event satisfies (AJ — 1 since Q'^ — Qj, thereby eliminating A^, 

from equation (2 I. Because ^j-^j exists by assumption, ^^^^ (ij = Q[ results and ^^^^ ("^0 

replaces G)^ ^^^^ (X) in equation (jij since they are identically distributed. Therefore, 

(©:) = p = ^« (^~. e ©Up) (x,)), 

where the coverage rate p and shape parameter s (p) are constrained such that G)^ ^^^^^ (ij = G)' for 
the observed value x^ — (a;i(i), 2:^(2), . . . of X^. □ 

Conditional independence is sufficient to satisfy the lemma's condition of identically distributed 
region estimators: 

Theorem 4. // Xi is conditionally independent of Xj and 6j given 0i for all i ^ j, then, for 
any focus index i, there is a map 0^ ^^^-^ : fl^ ^ A,, such that 0^ ^^.^^ (it) = Q''^ ^^^-^ (a;) with 
x^ — (^Xi(^i-j,Xi(^2)T ■ ■ T^i{S)} for every x G il, and the marginal distribution P^ is the confidence 
measure over corresponding to |G)^ ^^^-j : p G [0, 1] , s G 5*1 . 

Proof. By the conditional independence assumption, G)^ ^^^-j {X) is conditionally independent of 9j 
and Xj for all j ^ ,i{2) , . . . ,i ((S)} given 6*^, entailing the existence of a map ^(-^^ -.Q.^^A^ 

such that G)^ ^^^^ ("^') ^'^'^ ^(p) ^"'^■^ identically distributed. Then the above lemma yields 
the consequent. □ 

The theorem facilitates inference separately focused on each scalar subparameter 0i on the basis 
of the observation that Xi — Xi £ 0.1. 

Corollary 5. If Xi is conditionally independent of Xj and 9j given 9i for all i ^ j, then, for any 
i G {1,2,. the marginal distribution P^.^ is the confidence measure over Oj corresponding to 

some set ^(p) ■ P & [0, 1] , s G 5*1 of interval estimators, each a map 

Proof. Under the stated conditions, the theorem entails the existence of a map ©p ^(p-) '■ ^{i) ^ -^{i) 
such that G)p'!,(p) = ®p s(p) ('^) '^ith i^j^ = Xi for every x G and entails that the marginal 

distribution P'^^ is the confidence measure over G)^^) corresponding to |©p'!5(p) ■ P ^ [0, 1] , s G 5*1 . 

□ 

Remark 6. The applications of Sections [3]and[4]exploit this property in order to draw inferences from 
the confidence levels P^^^ {{ini Si, 0')) , P^^^ {(inf 02,0')) P^^^ {{inf 0^,0')) of the hypotheses 
01 < 0' , 02 < 0' , 0d < ^'j respectively, for very large d. Here, ^ = 1, each subscript (j) is the 
1-tuple representation of the vector t with j as its only component, and 0' is the scalar supremum 
shared by all d hypotheses. 

3 Null estimation for genome-scale screening 
3.1 Estimation of the null posterior 

In the presence of hundreds or thousands of hypotheses, the novel methodology of lEfron'f 2007a 



improve evidential inference by estimation of the null distribution. While Efron (j2007aj) originally 
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applied the estimator to effectively condition the LFDR on an estimated distribution of null p- 
values, he noted that its applications potentially encompass any procedure that depends on the 
distribution of statistics under the null hypothesis. Indeed, the stochasticity of parameters that 
enables estimation of the LFDR by the empirical Bayes machinery need not be assumed for the pre- 
decision purpose of deriving the level of confidence that each gene is differentially expressed. Thus, 



the methodology of Efron (2007a I outlined in Section in terms of p-values can be appropriated to 
adjust confidence levels since P^^^ {{~oo, 9')) , the level of confidence that a scalar subparameter 
Oi is less than a given scalar 0', is numerically equal to p^^^ {9') , the upper-tailed p- value for the test 
of the hypothesis that 9i = 9' . Specifically, confidence levels are adjusted in this paper according to 
the estimated confidence measure under the null hypothesis rather than according to an assumed 
confidence measure under the null hypothesis. 

Treating the parameters indicating differential expression as fixed rather than as exchangeable 
random quantities arguably provides a closer fit to the biological system in the sense that certain 
genes remain differentially expressed and other genes remain by comparison equivalently expressed 
across controlled conditions under repeated sampling. While the confidence measure is a probability 
measure on parameter space, its probabilities are interpreted as a degrees of confidence suitable for 



coherent decision making (^3.2 1, not as physical probabilities modeling a frequency of events in the 
system. The interpretation of parameter randomness in LFDR methods is less clear except when 
the LFDR is seen as an approximation to a Bayesian posterior probability under a hierarchical 
model. 



Example 7. A tomato development experiment of Alba et al. (2005 I yielded n = 6 observed ratios 
of mutant expression to wild-type expression in most of the d = 13, 340 genes on the microarray with 
missing data for many genes. For the iih gene, the interest parameter 9i is the expectation value 
of Xi, the logarithm of the expression ratio. The hypothesis 9i <Q corresponds to downregulation 
of gene i in the mutant, whereas 9i > Q corresponds to upregulation. To obviate estimation of a 
joint distribution of d parameters, the independence conditions of Corollary [5] are assumed to hold. 
Also assuming normally distributed Xi, the one-sample i-test gave the upper-tail p-value equal to 
the confidence level P^^^ ) for each gene. The notation is that of Remark |6j except with the 
replacement of each x subscript with Xi to emphasize that only the zth observed vector influences the 



confidence level corresponding to the ith parameter. Efron's (2007b} maximum-likelihood method 



of estimating the null distribution from a vector of p- values provided the estimated null confidence 
measure that is very close to the empirical distribution of the data (Fig. [ij, which is consistent 
with but does not imply the truth of all null hypotheses of equivalent expression {9i = 0). Using 
that estimate of the null distribution in place of the uniform distribution corresponding to the 
Student t distribution of test statistics has the effect of adjusting each confidence level. Since 
extreme confidence levels are adjusted toward 1/2, the estimated null reduces the confidence level 
both of genes with large values of P^^^ ) (confidence of the hypothesis 9i < 0) and of those with 
large values of P|^^ (confidence of the hypothesis 9i > 0). Fig. j2j displays the effect of this 

confidence-level adjustment in more detail. 



3.2 Genome-scale screening loss 



Carlin and Louis (2000 §B.5.2) observed that with a suitable non-additive loss function, optimal 
decisions in the presence of multiple comparisons can be made on the basis of minimizing posterior 
expected loss. A simple non-additive loss function is 

La^c {M, m) = cAf + m, (3) 

where M and m are respectively the number of incorrect decisions and the number of non-decisions 
concerning the d components of 9; M + m < d. The scalars a € and c > reflect different aspects 
of risk aversion: a is an acceleration in the sense of quantifying the interactive compounding effect 
of multiple errors, whereas if o = 0, then c is the ratio of the cost of making an incorrect decision 
to the cost of not making any decision or, equivalently, the benefit of making a correct decision. 
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estimated certainty level CDFs 




nominal certaintv level 



Figure 1: The black curve is the estimated cumulative distribution function (CDF) of the confi- 
dence levels under the null distribution, which corresponds to equivalently expressed or unaffected 
genes; the gray curve is the empirical CDF of all confidence levels, including those of differentially 
expressed or affected genes. Here, observed confidence coefficients corresponding to hypotheses are 
interpreted as levels of certainty (§S2.1 3.2 I. Departure of the black curve from the diagonal line 
reflects violation of independence or of the lognormal assumption used to compute the confidence 
levels. As one-sided p-values, these confidence levels would be uniform under the hypothesis of 
equivalent expression given the assumptions; i.e., the <i>~ ^-transformed confidence levels of unaf- 
fected genes are assumed to be N(0,1^) , where <i>~^ is the standard normal quantile function. 
The distribution of <i>~ ^-transformed confidence levels under that null hypothesis was estimated to 
instead be N 0.21, (1.55)^^ . The data set, model, and null distribution estimator are those of 
Example |7] 
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all 12726 genes with < p < 1 



1268 genes adjusted to insignificance 




Figure 2: Impact of null estimation on the confidence level as the measure of certainty or sta- 
tistical significance. The data set, model, and null distribution estimator are those of Example 
7 and Fig. jlj Left panel: The transformed confidence level <i>~^ (^(i) C^-)) gene i versus 
the expression ratio estimated as the geometric sample mean of the observed expression ratio for 
the same gene. Here, the confidence level P^^"^ {^-) is the degree of certainty of the hypothe- 
sis that the mean log-transformed expression ratio is negative or, equivalently, of the hypothesis 
that the true expression ratio is less than 1. The horizontal lines are drawn at P^^^ = 99% 

and at P^J^ = 1 - P^^ (M_) = 99%. Of the original 13,340 genes, 1062 genes have less than 

the two observations needed for the test statistic and 2 genes have infinite normal-transformed 
confidence levels and thus are not displayed. Each circle corresponds to a gene, with black for 
P^^ (R-;Fo^ , the confidence level of 0i g IR_ using the estimated null distribution Fq and with 

gray for P^^ (^-', Fq^ , the same except using the assumed null distribution Fq. Right panel: The 

difference between P^^ Fq^ and P^J^ Fq^ versus the estimated expression ratio. Orange 

circles represent genes satisfying P'^^ (r_; Fq^ > 99% but P^^ (r^; Fq^ < 99%; black circles 

represent genes satisfying P^^ (m+;Po) > 99% but P^* (k+;Po) < 99%. 
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Figure 3: Number d — moi decisions on whether the ith gene is overexpressed {6i > 0) or underex- 
pressed {9i < 0) plotted against 1 + a, where a is the degree to which the loss per incorrect decision 
increases with the number of incorrect decisions (|3|. The sign call or decision on the direction of 
regulation for each gene was either made or not made such that the following Monte Carlo approxi- 
mation to the expected loss {La,g (M, m)) = J La.g {M, m) dP^ was minimized based alternately 
on the assumed null distribution i^o ^-nd on the estimated null distribution Fq- The fcth of the lO** 
values of 9i was drawn from the frequentist posterior (jlj independently for each gene i to compute 
the correct sign decisions according to the fcth realization; such correct decisions yielded Mk and 
mfc, the number of incorrect sign decisions and the number of non-decisions. The independence 
of (7-fields corresponding to each gene's scalar component of 9 guaranteed by Corollary |5] implies 

{La^g (M, m)) = 10~ X]fe=i -^a,g {Mk,rnk) ■ The data set, model, and null distribution estimator 
are those of Example |7] and Figs. [l]and[2] 



Bickel (20041 and Miiller et al. (20041 applied additive loss (a = 0) to decisions of whether 
or not a biological feature is affected. That special case, however, does not accurately represent 
the screening purpose of most genome-scale studies, which is to formulate a reasonable number of 
hypotheses about features for confirmation in a follow-up experiment. More suitable for that goal, 
a > allows generation of hypotheses for at least a few features even on slight evidence without 
leading to unmanageably high numbers of features even in the presence of decisive evidence. 

Fig. |3 displays the result of minimizing such an expected loss with respect to the confidence 
posterior ( l]) under the above class of loss functions ([s} for decisions on the direction of differential 
gene expression (Example [tJi . (Taking the expectation value over the confidence measure rather 
than over a Bayesian posterior measure was justified in Section 2.1 



4 Null estimation as conditional inference 
4.1 Simulation study 

To record the effect of null distribution estimation on inductive inference, a simulation study was 
conducted with K = 500 independent samples each oi d = 10,000 independent observable vectors, 
of which 95% correspond to unaffected and 5% to affected features such as genes or single-nucleotide 
polymorphisms (SNPs). In Example [t] an affected gene is one for which there is differential gene 
expression between mutant and wild type. Assuming that each scalar parameter 9i is constrained 
to lie in the same set Qi, the one-sided p-value of each observable is equal to P^^ ((inf 0i, 9')) , the 
fcth confidence level of 9i < 9' , the hypothesis that the parameter of interest for the ith observable 
vector or feature is less than some value 9' dividing two meaningful hypotheses, as discussed in 
Section 2.2 and illustrated in Fig. [5] (This notation differs from that of Remark |6] in adapting 
the superscript of the confidence level and from that of Example |7] in dropping the subscript of 
Xk,i for ease of reading.) As 9.i = 9' is treated as a null hypothesis for the purpose of estimating 
or assuming the null distribution, it naturally corresponds an unaffected feature. Each confidence 
level was generated from <i>, the standard normal CDF, of Zk^i ^ N (O, <;^) for i € {1, . . . , 9500} or of 
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z 



2007a 



N (5<^fe/2, (5<rfe/4)^) for i G {9501, . . . , 10''}. Rather than fixing at 1 for all k (Efron 

Fig. 5), was instead allowed to vary across samples in order to model sample-specific variation 
that influences the distribution of p- values. For every k in {1, . . . , logtrfc is independent and 
equal to 2/3 with probability 30%, 1 with probability 40%, or 3/2 with probability 30%. Each 
simulated sample was analyzed with the same maximum-likelihood method of estimating the null 
distribution used in the above gene expression example, in which the realized value of was 
predicted to be about 3/2 (Fig. [l]). 

Because <jfc is an ancillary statistic in the sense that its distribution is not a function of the 
parameter and since estimation of the null distribution approximates conditioning the p-values 
and equivalent confidence levels on the estimated value of <:,k^ null estimation is required by the 



conditionality principle (Cox 19581, in agreement with the analogy with conditioning on observed 



row or column totals in contingency tables ( Efron 2007a I . See Shi ( 2008 1 for further explanation 



of the relevance of the principle to estimation of the null distribution. 

Accordingly, performance of each method of computing confidence levels, whether under the 
assumed null distribution Fq or estimated null distribution Fq, was evaluated in terms of the 
proximity of P'^^ ((inf 0i, 6*') ; Fq) , the confidence level of Qi < 9' for trial k and feature i based on 

the null hypothesis of distribution Fq G |Fo;-^o| i to P^ ^ ((inf 0i,0') |<rfc = (7^), the corresponding 
true confidence level conditional on the realized value ak of c:^ used to generate the simulated data 
of trial k. For some a € [0, 1] , the conservative error of relying on Fq as the distribution under 
the null hypothesis for the fcth trial is the average difference in the number of confidence levels 
incorrectly included in ,B = [a, 1 — a] and the number incorrectly included in ,B = [0, 1] \B : 



If 



Pl^ (61; Fo)) Ig (Pl^ m<yk)) - Ib {pL Ig (PL ^^o) 



(4) 



where Q'^ = (inf 0i, 9') and where X = {1, . . . , 9500} for the unaffected features or X = {9501, . . . , lO''} 
for the affected features. Here, a = 1% to quantify performance near confidence values relevant 
to the inference problem of interpreting the value of P^ ^ ((inf 0i, 9') ; Fq) as a degree of evidential 
support for 9i < 9'. Values of the conservatism Q for the simulation study described above appear 
in Fig. |4] 

To determine the effect of analyzing confidence levels that are valid marginal (unconditional) 
p-values for the mixture distribution, the confidence levels valid given = 1 were transformed such 
that those corresponding to unaffected features are tail-area probabilities under the marginal null 
distribution: 



Pe' {Zk,i < Zk^i) 



^ P{<,k = cr) Pe> {Zk^i < 2:fe,ikfc = cr) , 
Te{2/3, 1,3/2} 



where <I> {z^^i) or Pgr {Zk.i < z^^i) is the observed confidence level of 9k^i < 9' before or after trans- 
formation, respectively. Fig. |5] displays the results. 



4,2 Merit of estimating the null distribution 

While the degree of undesirable conservatism illustrates the potential benefit of null estimation 
(S4.ll, it does not provide case-specific guidance on whether to estimate the null distribution for 
a given data set generated by an unknown distribution. Framing the estimated null distribution 
as a conditioning statistic makes such guidance available from an adaptation of a general measure 
(Lloyd 19921 that quantifies the benefit of conditioning inference on a given statistic. Since an 



approximately ancillary statistic can be much more relevant for inference than an exactly ancillary 
statistic, Lloyd ( 1992 1 quantified the benefit of conditioning on a statistic by the sum of its degree of 



ancillarity and its degree of relevance, each degree defined in terms of observed Fisher information. 
To assess the benefit of conditioning inference on the estimated null distribution, the ancillarity 
and relevance are instead measured in terms of some nonnegative divergence or relative information 
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5% affected features 



95% unaffected features 
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(estimated null) 



I 1 1 1 1 1 

.02 0.02 0.04 0.06 0.08 

conservatism of certainty level 
abs.: 2% (0.75%. 1.8%. 2.8%) 




I 1 1 1 1 1 1 1 

-0.006 -0.002 0.002 0.006 

conservatism of certainty level 
abs.: 0.23% (0.042%. 0.18%. 0.33%) 



5% affected features 



95% unaffected features 



(< ssumed null) 



I — I — t — I — I — I — I 

-0.2 0.0 0.1 0.2 03 0.4 

conservatism of certainty level 
abs.: 17% (-21%. 0%. 31%) 



(assumed null) 



-0.10 



— I r 

-0.06 



-0.02 



1 



0.02 



conservatism of certainty level 
abs.: 4% (-9.9%. 0%. 1.7%) 



Figure 4: Conservative error ([i} when the assumed null distribution is equal to the true null 
distribution conditional on the most common value of the precision statistic {(^k = 1) ■ The null 
distribution _Fo is the estimated distribution Fq in the top two plots and the assumed distribution 
Fq in the bottom two plots. The two plots on the left and right give the errors averaged over the 
500 false and the 9500 true null hypotheses, respectively. 
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5% affected features 



95% unaffected features 



0.10 



(estimated lulll 




0.20 



0.30 



0.40 



conservatism of certainty ievei 
abs: 27% (20%. 29%. 32%) 



(estimated hull) 




0.012 



0.016 



1 1 1 

0.020 0.024 



conservatism of certainty level 
abs.: 1.8% (1.7%. 1.8%. 1.9%) 



5% affected features 



95% unaffected features 



(ass limed null) 



1 1 1 

0.0 0.2 0.4 

conservatism of certainty level 
abs: 24% (-13%. 15%. 45%) 



(assumed null) 



conservatism of certainty level 
abs.: 2.5% (-3 9%. 1.5%. 1.8%) 



Figure 5: Conservative error ([i} when the assumed null distribution is equal to the true null 
distribution marginal over the distribution of precision statistic <;k- The four plots have the same 
arrangement as those of Fig. |4] 
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I {F\\G) between distributions F and G as follows. The ancillarity of the estimated distribution 
Fq for di affected features is the extent to which the parameter of interest is independent of the 
estimate: 



(5) 



Here, Fq ^ represents the estimated null distribution with its di affected features replaced with 
unaffected features. More precisely, Fq ^ is the estimate of the null distribution obtained by replacing 
each of the di confidence levels farthest from 0.5 with (r — 1/2) /d, the expected order statistic under 
the assumed null distribution, where r is the rank of the distance of the replaced confidence level 
from 0.5. Exact ancillarity, A (di) = 0, thus results only when Fq ^ — Fq, which holds approximately 
for all di if Fq is close to the assumed null distribution. Conditioning on a null distribution estimate 
is effective to the extent that its relevance, 



i? = /(Fo||Fo 



(6) 



is higher than its nonancillarity, I ^Fg^||Fo^ 

The importance of tail probabilities in statistical inference calls for a measure of divergence 
I {F\\G) between distributions F and G with more tail dependence than the Kullback-Leibler 
divergence. The Renyi divergence Iq{F\\G) of order q S (0,1) satisfies this requirement, and 
A/2 (-P^IIG) has proved effective in signal processing as a compromise between the divergence with 
the most extreme dependence on improbable events {limq^Q Ig {F\\G)) and the Kullback-Leibler 
divergence (limg^i /g (F| |G)) . Another advantage of 5 = 1/2 is that the commutivity property 
Iq {F\\G) = Iq{G\\F) holds only for that order. The notation presents Iq {F\\G) as the order-g 
information gained by replacing G with F (Renyi 1970 §9.8). Since the random variables of the 
assumed and estimated null distributions are p- values or confidence levels transformed by <i>~^ (Fig. 
[ij and since both distributions are normal, the relative information of order 1/2 is simply 



h/2 (F||G) = -21og2 



iHF - Mg)' 



4(4 



+ 4) 



■In 



2a pCTc 



with F = N (^F, cr|) and G = N (/xg, ctg) • 
Assembling the above elements, the net 



nferential benefit of estimating the null distribution is 



B{di)^A{di) + R = Ii 



/2 



/2 



Fq^M 



Fo 



(7) 



if there are di affected features, where Fq = N (0, 1) and where the ancillarity A (di) and relevance R 
are given by equations ([5| and (j6| with I — Ii/2 - Basing inference on the estimated null distribution 
is effective to the extent that B{di) > 0. Fig. |6]uses the gene expression data to illustrate the use 
of B (di) to determine whether to rely on the estimated null distribution Fg or on the assumed null 
distribution Fq for inference. 



5 Discussion 

Whereas most adjustments for multiple comparisons are aimed at minimizing net loss incurred 
over a series of decisions optimized over the sample space rather than at weighing evidence in a 
particular data set for a hypothesis, adjustments resulting from estimation of the distribution of test 
statistics under the null hypothesis are appropriate for all forms of frequentist hypothesis testing 



(^ 1.1 1. A form seldom considered in non-Bayesian contexts is that of making coherent decisions by 
minimizing loss averaged over the parameter space. Taking a step toward filling this gap. Section 
3.2| provides a loss function suitable for genome-scale screening rather than for confirmatory testing 
and illustrates its application to the detecting evidence of gene upregulation or downregulation in 
microarray data. 
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T — I — I — I — I — r 

2000 4000 6000 
number of affected features 



Figure 6: The nonancillarity ~'A{di) versus the hypothetical number ki of affected features. The 
gray horizontal line is the relevance R of null estimation and thus indicates the point at which con- 
ditioning on the estimate goes from beneficial {\A (c?i)| < R) to deleterious {\A {di)\ > R) according 
to equation ([t}. The data set, model, and null distribution estimator are those of Example [t] and 
Figs. [1] [2] and H 



Simulations measured the extent to which estimating the null distribution improves conditional 
inference in an extreme multiple-comparisons setting such as that of finding evidence for differential 
gene expression in microarray measurements ( |4.1[ |. While confidence levels of evidence tended to 
err on the conservative side under both the estimated and assumed null distributions, conservative 
error quantified by numbers of confidence levels in [1%, 99%] compared to the confidence levels 
conditional on the precision statistic <;fc was excessive under the assumed null but negligible under the 
estimated null (Fig. |4}. (Since the same pattern of relative conditional performance was obtained 
by more realistically setting log ft- equal to a variate that is independent and uniformly distributed 
between log (1/2) and log (2) , those results were not displayed.) Due to the heavy tails of the 
marginal distribution of pre-transformed confidence levels under the null hypothesis, transforming 
them to satisfy that distribution under the assumed null increased their conditional conservatism, 
resulting in about the same performance of estimated and assumed null distributions with respect 
to the affected features. The case of the unaffected features is more interesting: the assumed null 
distribution, which after the transformation is marginally exact and hence valid for Neyman-Pearson 
hypothesis testing, incurs 35% more conservative error than the estimated null distribution (Fig. 
[5]). Thus, the use of the marginal null distribution in place of N (0, 1) , the distribution conditional 
on the central component of the mixture, substantially increases conservative error irrespective 
of whether the null is assumed or estimated. These results suggest that confidence levels better 
serve inductive inference when derived from a plausible conditional null distribution than from 
the marginal distribution even though the latter conforms to the Neyman-Pearson standard. This 
recommendation reinforces the conditionality principle, which is appropriate for the inferential goal 
of significance testing as opposed to the various decision-theoretic motivations behind Neyman- 



Pearson testing (^1.1 1. 



Since the findings of the simulation study do not guarantee the effectiveness of an estimated 

gave an information-theoretic 



4.2 



null distribution _Fo over the assumed null distribution .Fo, Section 
score for determining whether to depend on Fg in place ofFg for inference on the basis of a particular 
data set. The score serves as a tool for discovering whether the ancillarity and inferential relevance 
of Fq call for its use in inference and decision making. 
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